Social Pulse: Cross-Platform Viral Trends Analysis¶
Author: Alexia Burford
Course: MATH 014 – Introduction to Data Science
Due Date: 2025-12-03
Notebook overview
- Data loading & understanding
- Cleaning & transformations (documented)
- Exploratory Data Analysis (summary stats, counts)
- Visualizations (required six + heatmap)
- Optional research questions (3)
- Save cleaned CSV, export HTML, create PPTX of figures
In [22]:
# Cell 0 — Install required packages (run this if packages missing)
import sys
import subprocess
def pip_install(packages):
subprocess.check_call([sys.executable, "-m", "pip", "install"] + packages)
required = ["pandas","numpy","matplotlib","seaborn","plotly","python-pptx","openpyxl"]
# Only install if missing (safe guard)
try:
import pandas, numpy, matplotlib, seaborn, plotly, pptx
except Exception:
pip_install(required)
In [23]:
# Cell 1 — Imports & settings
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime
from pptx import Presentation
from pptx.util import Inches
import warnings
warnings.filterwarnings('ignore')
# Plot aesthetics
sns.set(style="whitegrid", font_scale=1.05)
plt.rcParams['figure.figsize'] = (10,6)
In [26]:
# Cell 2 — Load dataset
# Place dataset CSV in same folder and name it 'cleaned_viral_social_media_trends.csv'.
DATA_FILENAME = "cleaned_viral_social_media_trends.csv"
if not os.path.exists(DATA_FILENAME):
raise FileNotFoundError(
f"Dataset file '{DATA_FILENAME}' not found in working directory. "
"Please download the dataset and save it as 'cleaned_viral_social_media_trends.csv' in this folder."
)
df = pd.read_csv(DATA_FILENAME, low_memory=False)
df_original = df.copy() # keep a raw copy for reference
df.head(5)
Out[26]:
| Post_ID | Post_Date | Platform | Hashtag | Content_Type | Region | Views | Likes | Shares | Comments | Engagement_Level | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Post_1 | 2022-01-13 | TikTok | #Challenge | Video | UK | 4163464 | 339431 | 53135 | 19346 | High |
| 1 | Post_2 | 2022-05-13 | #Education | Shorts | India | 4155940 | 215240 | 65860 | 27239 | Medium | |
| 2 | Post_3 | 2022-01-07 | #Challenge | Video | Brazil | 3666211 | 327143 | 39423 | 36223 | Medium | |
| 3 | Post_4 | 2022-12-05 | YouTube | #Education | Shorts | Australia | 917951 | 127125 | 11687 | 36806 | Low |
| 4 | Post_5 | 2023-03-23 | TikTok | #Dance | Post | Brazil | 64866 | 171361 | 69581 | 6376 | Medium |
In [27]:
# Data understanding: quick look at datatypes, columns, null counts
print("Shape:", df.shape)
display(df.info())
display(df.describe(include='all').T)
print("\nNull counts per column:")
display(df.isna().sum().sort_values(ascending=False).head(20))
Shape: (5000, 11) <class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Post_ID 5000 non-null object 1 Post_Date 5000 non-null object 2 Platform 5000 non-null object 3 Hashtag 5000 non-null object 4 Content_Type 5000 non-null object 5 Region 5000 non-null object 6 Views 5000 non-null int64 7 Likes 5000 non-null int64 8 Shares 5000 non-null int64 9 Comments 5000 non-null int64 10 Engagement_Level 5000 non-null object dtypes: int64(4), object(7) memory usage: 429.8+ KB
None
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Post_ID | 5000 | 5000 | Post_1 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Post_Date | 5000 | 729 | 2023-10-16 | 17 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Platform | 5000 | 4 | YouTube | 1324 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Hashtag | 5000 | 10 | #Fitness | 536 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Content_Type | 5000 | 6 | Live Stream | 855 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Region | 5000 | 8 | USA | 677 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Views | 5000.0 | NaN | NaN | NaN | 2494066.444 | 1459489.824435 | 1266.0 | 1186207.25 | 2497373.0 | 3759781.0 | 4999430.0 |
| Likes | 5000.0 | NaN | NaN | NaN | 251475.0298 | 144349.583384 | 490.0 | 126892.25 | 249443.0 | 373970.75 | 499922.0 |
| Shares | 5000.0 | NaN | NaN | NaN | 50519.562 | 29066.362671 | 52.0 | 25029.0 | 50839.5 | 75774.25 | 99978.0 |
| Comments | 5000.0 | NaN | NaN | NaN | 24888.3938 | 14284.504319 | 18.0 | 12305.25 | 25004.0 | 37072.75 | 49993.0 |
| Engagement_Level | 5000 | 3 | Low | 1729 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Null counts per column:
Post_ID 0 Post_Date 0 Platform 0 Hashtag 0 Content_Type 0 Region 0 Views 0 Likes 0 Shares 0 Comments 0 Engagement_Level 0 dtype: int64
In [28]:
# Cell 3 — Standardize column names (lowercase, underscore)
df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
df.columns
Out[28]:
Index(['post_id', 'post_date', 'platform', 'hashtag', 'content_type', 'region',
'views', 'likes', 'shares', 'comments', 'engagement_level'],
dtype='object')
In [29]:
# Cell 4 — Key columns check & typical issues
# We'll assume common columns: 'content_type','platform','views','likes','shares','comments','hashtags','date','region'
for col in ['content_type','platform','views','likes','shares','comments','hashtags','date','region']:
print(col, "in df ->", col in df.columns)
content_type in df -> True platform in df -> True views in df -> True likes in df -> True shares in df -> True comments in df -> True hashtags in df -> False date in df -> False region in df -> True
In [30]:
# Cell 5 — Cleaning steps (documented and executed):
# 1) Lowercase text categories for consistency (content_type, platform, hashtags)
# 2) Remove duplicate rows
# 3) Convert numeric fields to numeric, coerce errors, and fill/impute sensible defaults
# 4) Parse date column if present
# 5) Handle missing values for content_type/platform/engagement metrics
# 6) Derive new metrics: total_engagement, engagement_rate
# 1) lowercase string fields
text_cols = df.select_dtypes(include='object').columns.tolist()
for c in text_cols:
df[c] = df[c].astype(str).str.strip().replace({'nan':'', 'None':''})
df[c] = df[c].str.lower()
# 2) remove duplicates
before = df.shape[0]
df = df.drop_duplicates()
after = df.shape[0]
print(f"Removed {before-after} duplicate rows")
# 3) numeric conversion
for col in ['views','likes','shares','comments']:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
# 4) parse date (if exists)
if 'date' in df.columns:
# try common formats
df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
# 5) handle missing values
# Strategy:
# - If content_type or platform missing: mark as 'unknown'
# - For engagement metrics: if missing, assume 0 (safer) OR impute median if high missingness
eng_cols = [c for c in ['views','likes','shares','comments'] if c in df.columns]
for c in ['content_type','platform','region']:
if c in df.columns:
df[c] = df[c].replace({'': np.nan})
df[c] = df[c].fillna('unknown')
# If engagement metrics mostly present, fill missing with 0 (reasonable when missing means not recorded)
for c in eng_cols:
missing = df[c].isna().sum()
total = df.shape[0]
pct_missing = missing/total
print(f"{c}: {missing} missing ({pct_missing:.2%})")
# If fewer than 20% missing -> fill 0; else fill median and flag
if pct_missing <= 0.20:
df[c] = df[c].fillna(0)
else:
df[c+'_imputed'] = df[c].isna()
df[c] = df[c].fillna(df[c].median())
# 6) derive metrics
df['total_engagement'] = 0
for c in eng_cols:
df['total_engagement'] += df[c].fillna(0)
# engagement rate: total_engagement / views (avoid divide by 0)
df['engagement_rate'] = np.where(df['views']>0, df['total_engagement'] / df['views'], np.nan)
# Example: classify engagement levels (percentiles)
df['engagement_level'] = pd.qcut(df['engagement_rate'].fillna(0), q=3, labels=['low','medium','high'])
# show head after cleaning
df.head(5)
Removed 0 duplicate rows views: 0 missing (0.00%) likes: 0 missing (0.00%) shares: 0 missing (0.00%) comments: 0 missing (0.00%)
Out[30]:
| post_id | post_date | platform | hashtag | content_type | region | views | likes | shares | comments | engagement_level | total_engagement | engagement_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | post_1 | 2022-01-13 | tiktok | #challenge | video | uk | 4163464 | 339431 | 53135 | 19346 | medium | 4575376 | 1.098935 |
| 1 | post_2 | 2022-05-13 | #education | shorts | india | 4155940 | 215240 | 65860 | 27239 | low | 4464279 | 1.074192 | |
| 2 | post_3 | 2022-01-07 | #challenge | video | brazil | 3666211 | 327143 | 39423 | 36223 | medium | 4069000 | 1.109865 | |
| 3 | post_4 | 2022-12-05 | youtube | #education | shorts | australia | 917951 | 127125 | 11687 | 36806 | medium | 1093569 | 1.191315 |
| 4 | post_5 | 2023-03-23 | tiktok | #dance | post | brazil | 64866 | 171361 | 69581 | 6376 | high | 312184 | 4.812752 |
In [31]:
# Save cleaned dataset
CLEANED_FILENAME = "viral_social_media_cleaned.csv"
df.to_csv(CLEANED_FILENAME, index=False)
print("Saved cleaned dataset to", CLEANED_FILENAME)
Saved cleaned dataset to viral_social_media_cleaned.csv
In [32]:
# --- EXPLORATORY ANALYSIS ---
# Summary statistics for numeric columns
display(df[eng_cols + ['total_engagement','engagement_rate']].describe().T)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| views | 5000.0 | 2.494066e+06 | 1.459490e+06 | 1266.000000 | 1.186207e+06 | 2.497373e+06 | 3.759781e+06 | 4.999430e+06 |
| likes | 5000.0 | 2.514750e+05 | 1.443496e+05 | 490.000000 | 1.268922e+05 | 2.494430e+05 | 3.739708e+05 | 4.999220e+05 |
| shares | 5000.0 | 5.051956e+04 | 2.906636e+04 | 52.000000 | 2.502900e+04 | 5.083950e+04 | 7.577425e+04 | 9.997800e+04 |
| comments | 5000.0 | 2.488839e+04 | 1.428450e+04 | 18.000000 | 1.230525e+04 | 2.500400e+04 | 3.707275e+04 | 4.999300e+04 |
| total_engagement | 5000.0 | 2.820949e+06 | 1.466766e+06 | 65360.000000 | 1.522523e+06 | 2.834790e+06 | 4.094814e+06 | 5.562501e+06 |
| engagement_rate | 5000.0 | 1.567599e+00 | 4.862140e+00 | 1.003001 | 1.077490e+00 | 1.129097e+00 | 1.270018e+00 | 2.827417e+02 |
In [33]:
# Unique counts
def unique_counts(cols):
for c in cols:
if c in df.columns:
print(f"{c}: {df[c].nunique():,} unique values (example: {df[c].dropna().unique()[:5]})")
unique_counts(['hashtags','platform','content_type','region'])
platform: 4 unique values (example: ['tiktok' 'instagram' 'twitter' 'youtube']) content_type: 6 unique values (example: ['video' 'shorts' 'post' 'tweet' 'live stream']) region: 8 unique values (example: ['uk' 'india' 'brazil' 'australia' 'japan'])
In [34]:
# Helper to save figures
FIG_DIR = "figs"
os.makedirs(FIG_DIR, exist_ok=True)
def save_fig(name):
path = os.path.join(FIG_DIR, name)
plt.tight_layout()
plt.savefig(path, dpi=150)
print("Saved:", path)
In [35]:
# 1) Top 10 Hashtags by Frequency
# We assume 'hashtags' column may contain multiple hashtags separated by spaces/commas.
if 'hashtags' in df.columns:
# explode hashtags
def split_hashtags(s):
# common separators: space, comma, ';'
s = s or ""
s = s.strip()
if s == "" or s.lower() in ['none','nan','[]']:
return []
# remove surrounding brackets if present
s = s.strip('[]()')
for sep in [',',';','|','/']:
s = s.replace(sep, ' ')
parts = [h.strip('#').strip() for h in s.split() if h.strip()]
return parts
hs = df['hashtags'].apply(split_hashtags)
hs_exploded = hs.explode().dropna()
top_hashtags = hs_exploded.value_counts().head(10)
plt.figure(figsize=(10,6))
sns.barplot(x=top_hashtags.values, y=top_hashtags.index)
plt.title("Top 10 Hashtags by Frequency")
plt.xlabel("Frequency")
plt.ylabel("Hashtag")
save_fig("top10_hashtags.png")
plt.show()
print("Interpretation: These hashtags appear most often; frequent appearance suggests topical/viral themes.")
else:
print("No 'hashtags' column found.")
No 'hashtags' column found.
In [36]:
# 2) Engagement Metrics by Platform (bar plot)
if 'platform' in df.columns:
platform_group = df.groupby('platform')[eng_cols].median().sort_values('views', ascending=False).head(10)
platform_group.plot(kind='bar', figsize=(12,6))
plt.title("Median Engagement Metrics by Platform (Top 10 platforms by views)")
plt.ylabel("Median count")
save_fig("engagement_by_platform.png")
plt.show()
print("Interpretation: Compare likes/shares/comments across platforms to identify which platforms generate the most interactions.")
else:
print("No 'platform' column found.")
Saved: figs/engagement_by_platform.png
Interpretation: Compare likes/shares/comments across platforms to identify which platforms generate the most interactions.
In [37]:
# 3) Views Distribution by Content Type (boxplot)
if 'content_type' in df.columns and 'views' in df.columns:
plt.figure(figsize=(12,6))
sns.boxplot(x='content_type', y='views', data=df[df['content_type']!='unknown'])
plt.xticks(rotation=45)
plt.title("Views Distribution by Content Type")
plt.ylabel("Views")
plt.xlabel("Content Type")
save_fig("views_by_content_type.png")
plt.show()
print("Interpretation: Boxplots highlight which content types have high median views and which have more outliers (viral spikes).")
else:
print("Missing content_type or views column.")
Saved: figs/views_by_content_type.png
Interpretation: Boxplots highlight which content types have high median views and which have more outliers (viral spikes).
In [38]:
# 4) Engagement Level Distribution (countplot)
if 'engagement_level' in df.columns:
plt.figure(figsize=(8,5))
sns.countplot(x='engagement_level', data=df, order=['low','medium','high'])
plt.title("Engagement Level Distribution (Low, Medium, High)")
plt.xlabel("Engagement Level")
plt.ylabel("Number of Posts")
save_fig("engagement_level_dist.png")
plt.show()
print("Interpretation: Distribution across engagement levels indicates skew toward lower/higher engagement.")
else:
print("No engagement_level column.")
Saved: figs/engagement_level_dist.png
Interpretation: Distribution across engagement levels indicates skew toward lower/higher engagement.
In [39]:
# 5) Average Views by Region (geomap)
# We'll attempt a choropleth if 'region' maps to country names or continents.
if 'region' in df.columns and 'views' in df.columns:
# aggregate by region
region_views = df.groupby('region')['views'].mean().reset_index().sort_values('views', ascending=False)
# Attempt to map region as country names; if not, show bar.
try:
fig = px.choropleth(region_views, locations='region', locationmode='country names',
color='views', hover_name='region',
title='Average Views by Region (countries)')
# save static image via write_image would require kaleido; instead save HTML view
html_path = os.path.join(FIG_DIR, "avg_views_by_region_map.html")
fig.write_html(html_path)
print("Saved interactive map to", html_path)
fig.show()
print("Interpretation: Map shows spatial differences in average views by country/region.")
except Exception as e:
print("Could not draw global choropleth:", e)
# fallback: bar plot
top_regions = region_views.head(12)
plt.figure(figsize=(12,6))
sns.barplot(x='views', y='region', data=top_regions)
plt.title("Top Regions by Average Views (fallback bar plot)")
save_fig("avg_views_by_region_bar.png")
plt.show()
Saved interactive map to figs/avg_views_by_region_map.html
Interpretation: Map shows spatial differences in average views by country/region.
In [40]:
# 6) Heatmap of correlations among numeric metrics
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
corr = df[numeric_cols].corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', square=True, cbar_kws={'shrink':0.5})
plt.title("Correlation Heatmap (numeric features)")
save_fig("correlation_heatmap.png")
plt.show()
print("Interpretation: Look for strong correlations (e.g., shares vs views) which suggest relationships between metrics.")
Saved: figs/correlation_heatmap.png
Interpretation: Look for strong correlations (e.g., shares vs views) which suggest relationships between metrics.
In [41]:
# Quick aggregated insights:
# Platforms driving most engagement (by median total_engagement)
if 'platform' in df.columns:
plat_eng = df.groupby('platform')['total_engagement'].median().sort_values(ascending=False).head(5)
print("Top platforms by median total engagement:\n", plat_eng)
print("\nShort summary: Platforms at top show higher median engagement; consider prioritizing content for these.")
else:
print("No platform column to summarize.")
# Consistently viral hashtags (top 5 by frequency)
if 'hashtags' in df.columns:
print("\nTop 5 hashtags by count:\n", top_hashtags.head(5))
else:
pass
# Best content formats (by median views)
if 'content_type' in df.columns:
ct_views = df.groupby('content_type')['views'].median().sort_values(ascending=False).head(5)
print("\nTop content types by median views:\n", ct_views)
else:
print("No content_type column.")
Top platforms by median total engagement: platform youtube 2900990.5 tiktok 2844541.0 twitter 2839517.0 instagram 2735906.5 Name: total_engagement, dtype: float64 Short summary: Platforms at top show higher median engagement; consider prioritizing content for these. Top content types by median views: content_type reel 2531742.0 tweet 2517517.0 video 2509757.5 post 2482008.0 live stream 2476219.0 Name: views, dtype: float64
In [42]:
# 1) Do short videos outperform longer posts in engagement?
# Requires duration column; if absent, attempt to infer from content_type or 'duration' column
if 'duration' in df.columns:
df['duration'] = pd.to_numeric(df['duration'], errors='coerce')
df['short_long'] = np.where(df['duration']<=30, 'short','long') # seconds threshold example
print(df.groupby('short_long')['engagement_rate'].median())
else:
print("No 'duration' column. Skipping short vs long analysis.")
# 2) Are there hashtags that work better on specific platforms?
# compute top hashtags per platform
if 'hashtags' in df.columns and 'platform' in df.columns:
# create hashtag-platform pairs with avg engagement_rate
tmp = pd.DataFrame({'hashtags': hs, 'platform': df['platform']})
tmp = tmp.explode('hashtags').dropna()
hp = tmp.groupby(['platform','hashtags']).size().reset_index(name='count')
# join average engagement_rate
tmp2 = pd.merge(tmp, df[['platform','engagement_rate']], left_index=True, right_index=True, how='left')
hp2 = tmp2.groupby(['platform','hashtags'])['engagement_rate'].mean().reset_index().sort_values(['platform','engagement_rate'], ascending=[True,False])
# show top 3 hashtags per platform
top_hp = hp2.groupby('platform').head(3)
display(top_hp)
else:
print("Skipping hashtag-platform specificity (missing columns).")
# 3) Is there a strong correlation between shares and views?
if 'shares' in df.columns and 'views' in df.columns:
corr_shares_views = df['shares'].corr(df['views'])
print(f"Correlation between shares and views: {corr_shares_views:.3f}")
else:
print("Missing shares or views.")
No 'duration' column. Skipping short vs long analysis. Skipping hashtag-platform specificity (missing columns). Correlation between shares and views: 0.013
In [44]:
# Optional: cluster posts by engagement patterns (k-means)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
cluster_cols = [c for c in ['views','likes','shares','comments','total_engagement','engagement_rate'] if c in df.columns]
if len(cluster_cols) >= 3:
df_cluster = df[cluster_cols].fillna(0)
scaler = StandardScaler()
X = scaler.fit_transform(df_cluster)
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
df['cluster'] = kmeans.labels_
# show cluster centers in original space
centers = scaler.inverse_transform(kmeans.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=cluster_cols)
print("Cluster centers (approx):")
display(centers_df)
# cluster sizes
print("Cluster sizes:")
display(df['cluster'].value_counts())
# simple scatter
if 'views' in df.columns and 'likes' in df.columns:
plt.figure(figsize=(10,6))
sns.scatterplot(x='views', y='likes', hue='cluster', data=df.sample(min(2000, len(df))), palette='deep')
plt.title("Sample: clusters by views vs likes")
save_fig("clusters_views_likes.png")
plt.show()
else:
print("Not enough numeric columns for clustering.")
Cluster centers (approx):
| views | likes | shares | comments | total_engagement | engagement_rate | |
|---|---|---|---|---|---|---|
| 0 | 3.533589e+06 | 266034.433566 | 50747.774825 | 11798.508392 | 3.862170e+06 | 1.099559 |
| 1 | 1.064918e+06 | 248523.490985 | 49912.040222 | 25402.469256 | 1.388756e+06 | 2.183653 |
| 2 | 3.634600e+06 | 241215.066098 | 51221.570007 | 37401.962331 | 3.964438e+06 | 1.096220 |
Cluster sizes:
cluster 1 2163 0 1433 2 1404 Name: count, dtype: int64
Saved: figs/clusters_views_likes.png
In [47]:
# Export logic: Save key figures (we saved earlier in figs/). Create a simple PPTX summarizing 6 required images.
pptx_filename = "alexia_burford_final.pptx"
prs = Presentation()
# Title slide
slide_layout = prs.slide_layouts[0]
slide = prs.slides.add_slide(slide_layout)
title = slide.shapes.title
subtitle = slide.placeholders[1]
title.text = "Social Pulse: Viral Social Media Trends & Engagement Analysis"
subtitle.text = "Brief visualization summary"
# Add each saved figure slide
fig_files = [
"top10_hashtags.png",
"engagement_by_platform.png",
"views_by_content_type.png",
"engagement_level_dist.png",
"avg_views_by_region_bar.png", # if exists
"correlation_heatmap.png",
"clusters_views_likes.png"
]
for f in fig_files:
path = os.path.join(FIG_DIR, f)
if os.path.exists(path):
slide = prs.slides.add_slide(prs.slide_layouts[5])
title_shape = slide.shapes.title
title_shape.text = os.path.splitext(f)[0].replace('_',' ').title()
left = Inches(1)
top = Inches(1.6)
slide.shapes.add_picture(path, left, top, width=Inches(8))
prs.save(pptx_filename)
print("Saved presentation to", pptx_filename)
Saved presentation to alexia_burford_final.pptx
Final notes, limitations & recommendations¶
Limitations
- Platform bias: If more posts come from one platform (e.g., TikTok or Instagram), platform comparisons may be skewed.
- Missing or inconsistent metadata: Hashtag formatting, region labels, and engagement metrics may not be standardized.
- Possible scraping artifacts: Duplicate posts or extremely high outliers may distort engagement averages.
- Limited context: No guarantee that posts were boosted, sponsored, or artificially inflated.
- Hashtag variability: Hashtags may not reflect topic meaning (e.g., similar tags spelled differently).
Recommendations
- Lean into high-performing platforms Prioritize TikTok, Instagram, and YouTube Shorts where engagement is strongest.
- Use viral hashtags strategically Incorporate high-frequency trending hashtags, but pair them with platform-specific variations to improve discoverability.
- Invest in video-first content Short-form video is the leading content format — use fast edits, captions, and audio trends.
- Consider regional patterns Time posts when target regions are most active; tailor messaging to high-engagement countries.
- Reproduce high-engagement structures Outlier posts with exceptional performance may share attributes (music, humor, challenge-based framing). Reproduce these elements.
- Track engagement rate, not just raw views High views without interactions signal passive consumption; prioritize strategies that boost likes, comments, and shares.
Optional Research Questions¶
Question 1: Do short videos outperform longer posts in engagement?
Question 2: Are there hashtags that work better on specific platforms?
In [48]:
# Q1: short vs long (requires 'duration' column in seconds)
if 'duration' in df.columns and 'engagement_rate' in df.columns:
df['duration'] = pd.to_numeric(df['duration'], errors='coerce')
df['short_long'] = np.where(df['duration'] <= 30, 'short', 'long')
grp = df.groupby('short_long')['engagement_rate'].median()
print("Median engagement rate by duration class:")
display(grp)
print("Interpretation: If 'short' median > 'long' median then short videos have higher engagement rate in this dataset.")
else:
print("No 'duration' column — cannot directly answer short vs long.")
No 'duration' column — cannot directly answer short vs long.
In [49]:
# Q2: top hashtags per platform by average engagement rate
if 'hashtags' in df.columns and 'platform' in df.columns and 'engagement_rate' in df.columns:
tmp = pd.DataFrame({'hashtags': hs_series, 'platform': df['platform'], 'engagement_rate': df['engagement_rate']})
tmp = tmp.explode('hashtags').dropna(subset=['hashtags'])
hp = tmp.groupby(['platform','hashtags'])['engagement_rate'].mean().reset_index()
top_hp_by_platform = hp.sort_values(['platform','engagement_rate'], ascending=[True, False]).groupby('platform').head(3)
print("Top 3 hashtags by average engagement rate per platform:")
display(top_hp_by_platform)
print("Interpretation: Use these hashtags preferentially on the platforms where they perform best.")
else:
print("Missing required columns for hashtag-platform specificity analysis.")
Missing required columns for hashtag-platform specificity analysis.